Exploratory Data Analysis (EDA)¶

Project #2 PoC¶

Project name: E-commerce products¶

Team name: MOP¶


We focus on Web Data Commons - Training Dataset and Gold Standard for Large-Scale Prod- uct Matching dataset (WDC for short) prepared by the staff of the University of Mannheim (Primpeli et al., 2019).

Additionally, each offer is linked to a specific product (cluster id) and contains textual attributes such as title, description etc.

The training datasets are available in different sizes, varying from small to extra large. In every dataset, the ratio between positive and negative pairs is 1:3.
The proportions between the different collections within one category are as follows:

  • 1 – small,
  • 3 – medium,
  • 15 – large,
  • and 50 – extra large (xlarge).

    The Gold Standard is verified manually and should be used for testing purposes.
    Because our time and computational resources are limited, we focus on the large size.

In [ ]:
import pandas as pd
from eda_source import *

Reading the data¶

The data is provided as four different files - each per category: Computers, Cameras, Watches, and Shoes. This division is helpful as we plan to train separate models for each category, so no additional merging is required.

In [ ]:
# Training data
df_computers = pd.read_json(path_or_buf='./data/computers_train_medium.json', lines=True)

Feature and observation meaning¶

Each observation is a pair of such offers and a label indicating whether these two offers are for the same product (a positive pair) or not (a negative pair). Even in the case of a negative pair, both offers belong to the same category (but different clusters/products).

In [ ]:
print(f"In all categories there are the same features: \n {df_computers.columns.values}")
In all categories there are the same features: 
 ['id_left' 'title_left' 'description_left' 'brand_left' 'price_left'
 'specTableContent_left' 'keyValuePairs_left' 'category_left'
 'cluster_id_left' 'identifiers_left' 'id_right' 'title_right'
 'description_right' 'brand_right' 'price_right' 'specTableContent_right'
 'keyValuePairs_right' 'category_right' 'cluster_id_right'
 'identifiers_right' 'label' 'pair_id']

Feature meaning:

id: Unique integer identifier of an offer
cluster_id: The integer ID of the cluster (product) to which an offer belongs.
identifiers: A list of all identifier values that were assigned to an offer together with the schema.org terms that were used to annotate the values.
category: One of 25 product categories the product was assigned to, NaN if not part of the English subset.
title: The product title.
description: The product description.
brand: The product brand.
price: The product price.
specTableContent: The specification table content of the products website as one string.
keyValuePairs: The key-value pairs that were extracted from the specification tables using the method described above.


Note: the 'right' suffix represents a first offer in a pair, the 'left' suffix represents the other offer.
A positive pair: cluster_id_left == cluster_id_right
A negative pair: cluster_id_left != cluster_id_right


Computers¶

In [ ]:
df_computers.head()
Out[ ]:
id_left title_left description_left brand_left price_left specTableContent_left keyValuePairs_left category_left cluster_id_left identifiers_left ... description_right brand_right price_right specTableContent_right keyValuePairs_right category_right cluster_id_right identifiers_right label pair_id
0 16876009 495906 b21 hp x5560 2 80ghz ml350 g6 , null ne... description intel xeon x5560 ml350 g6 2 80ghz ... hp enterprise None specifications category proliant processor sub... {'category': 'proliant processor', 'sub catego... Computers_and_Accessories 1679624 [{'/sku': '[495906b21]'}, {'/mpn': '[495906b21... ... description intel xeon x5560 ml350 g6 2 80ghz ... hp enterprise usd 213 85 specifications category proliant processor sub... {'category': 'proliant processor', 'sub catego... Computers_and_Accessories 1679624 [{'/mpn': '[495906b21]'}] 1 16876009#16248399
1 5490217 hp intel xeon x5560 prijzen tweakers None None None categorie processors merk hp product hp intel ... {'categorie': 'processors', 'merk': 'hp', 'pro... Computers_and_Accessories 1679624 [{'/mpn': '[495906b21]'}, {'/gtin13': '[884420... ... description intel xeon x5560 ml350 g6 2 80ghz ... hp enterprise usd 213 85 specifications category proliant processor sub... {'category': 'proliant processor', 'sub catego... Computers_and_Accessories 1679624 [{'/mpn': '[495906b21]'}] 1 5490217#16248399
2 16543750 asus motherboard lga2066 ddr4 m 2 u atx 2xgbe ... None None None None None Computers_and_Accessories 109916 [{'/sku': '[34181655]'}, {'/mpn': '[primex299d... ... None None None categorie moederborden merk asus product asus ... {'categorie': 'moederborden', 'merk': 'asus', ... Computers_and_Accessories 109916 [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... 1 16543750#16721450
3 16721450 asus prime x299 deluxe prijzen tweakers None None None categorie moederborden merk asus product asus ... {'categorie': 'moederborden', 'merk': 'asus', ... Computers_and_Accessories 109916 [{'/mpn': '[primex299deluxe, 90mb0ty0m0eay0]'... ... support for x series intel core processors sli... asus None None None Computers_and_Accessories 109916 [{'/mpn': '[90mb0ty0m0eay0]'}, {'/gtin13': '[4... 1 16721450#10358026
4 10358026 asus prime x299 deluxe socket 2066 intel atx m... support for x series intel core processors sli... asus None None None Computers_and_Accessories 109916 [{'/mpn': '[90mb0ty0m0eay0]'}, {'/gtin13': '[4... ... atx quad channel ddr4 3 x pcie 3 0 x16 2 x m 2... asus None None None Computers_and_Accessories 109916 [{'/productID': '[asux29del]'}, {'/mpn': '[pri... 1 10358026#4588573

5 rows × 22 columns

In [ ]:
n = df_computers.shape[0]
m = df_computers.shape[1]

print(f'Number of pairs: {n}, number of features: {m}')
Number of pairs: 8094, number of features: 22
In [ ]:
print(f'Number of positive pairs: {get_positive_pairs_count(df_computers)} - {round(get_positive_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
print(f'Number of negative pairs: {get_negative_pairs_count(df_computers)} - {round(get_negative_pairs_count(df_computers) * 100 / n, 2)}% of all observations.')
Number of positive pairs: 6332 - 78.23% of all observations.
Number of negative pairs: 1762 - 21.77% of all observations.
In [ ]:
plot_positive_vs_negative(df_computers)

Missing values (right, and separately left)¶

We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.

In [ ]:
missing_values = {}

Title¶

In [ ]:
col = 'title_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_left = 0, which is 0.0% of all pairs>
In [ ]:
col = 'title_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: title_right = 0, which is 0.0% of all pairs>

Description¶

In [ ]:
col = 'description_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_left = 2442, which is 30.17% of all pairs>
In [ ]:
col = 'description_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: description_right = 2461, which is 30.41% of all pairs>

Brand¶

In [ ]:
col = 'brand_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_left = 3948, which is 48.78% of all pairs>
In [ ]:
col = 'brand_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: brand_right = 3996, which is 49.37% of all pairs>

Price¶

In [ ]:
col = 'price_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_left = 6520, which is 80.55% of all pairs>
In [ ]:
col = 'price_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: price_right = 6653, which is 82.2% of all pairs>

specTableContent¶

In [ ]:
col = 'specTableContent_left'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_left = 5303, which is 65.52% of all pairs>
In [ ]:
col = 'specTableContent_right'
print_missing_values(df_computers, col, missing_values)
Number of missing values in the column: specTableContent_right = 5256, which is 64.94% of all pairs>
In [ ]:
plot_missing_values(missing_values, n)

Missing values (right and left simultaneously)¶

Number of missing (for both right and left offer simultaneously) values per feature.

In [ ]:
missing_values_simul = {}
In [ ]:
col_prefix = 'title'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both title_right and  title_left): = 0, which is 0.0% of all pairs>
In [ ]:
col_prefix = 'description'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both description_right and  description_left): = 954, which is 11.79% of all pairs>
In [ ]:
col_prefix = 'brand'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both brand_right and  brand_left): = 2575, which is 31.81% of all pairs>
In [ ]:
col_prefix = 'price'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both price_right and  price_left): = 5509, which is 68.06% of all pairs>
In [ ]:
col_prefix = 'specTableContent'
print_missing_values_simultaneously(df_computers, col_prefix, missing_values_simul)
Number of missing values in the column: (both specTableContent_right and  specTableContent_left): = 4074, which is 50.33% of all pairs>
In [ ]:
plot_missing_values_simultaneously(missing_values_simul, n)

Average length of title and description¶

In [ ]:
avg_lens = {}
avg_lens_pos = {}
avg_lens_neg = {}
In [ ]:
col = 'title_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_right for - all pairs = 69.34655300222387, pos. pairs = 69.33828174352496, neg. pairs = 69.37627695800226
In [ ]:
col = 'title_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: title_left for - all pairs = 68.4372374598468, pos. pairs = 68.50236891977258, neg. pairs = 68.20317820658343
In [ ]:
col = 'description_left'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_left for - all pairs = 241.86940943909067, pos. pairs = 242.86465571699304, neg. pairs = 238.2928490351873
In [ ]:
col = 'description_right'
print_avg_lengths(df_computers, col, avg_lens, avg_lens_pos, avg_lens_neg)
The average number of the column: description_right for - all pairs = 215.88448233259206, pos. pairs = 216.29153506001262, neg. pairs = 214.4216799091941
In [ ]:
plot_avg_lengths(avg_lens, avg_lens_pos, avg_lens_neg, n)
In [ ]:
import plotly
## Preserve plotly functionality in a .html file
plotly.offline.init_notebook_mode()